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Abstract 

We study the performance of MPI checkerboard code for SU(3) lattice gauge theory as function of the number of 
MPI processes, which run in parallel on an identical number of CPU cores. Computing platforms explored are a 
small PC cluster at FSU and the Cray at NERSC. 
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1. Introduction 

In a previous paper [1] Fortran MPI checker- 
board code for Markov Chain Monte Carlo 
(MCMC) simulations of pure SU(3) Lattice Gauge 
Theory (LGT) with the Wilson action is intro- 
duced and a number of tests and verifications are 
provided. These programs allow for simulations 
with periodic boundary conditions (PBC) as well 
as for the geometry of a double-layered torus 
(DLT), which remains to be explored in more 
details. 

Here we extend this work and investigate the 
performance as function of the number of CPU 
cores used by an equal number of MPI processes. 
Tests were carried out on a cluster of 2 high end 
PCs with together 16 cores at the High Energy 
Physics (HEP) group of the Florida State Univer- 
sity (FSU) and on a Cray XT4 (named Frankhn) 
with 38 640 cores at the National Energy Research 



Scientific Computing Center (NERSC) [2] of the 
Lawrence Berkeley National Laboratory. In the lat- 
ter case we used up to 1 296 cores. 

In the next section we report for both, PBC 
and DLT, the performance of our code as func- 
tion of the number of CPU cores. This is followed 
by summary and conclusions. The appendix dis- 
cusses annoying subtleties, which we encountered 
with MPI send and receive instructions, making 
slightly modified versions b and c of the code nec- 
essary for the test runs of this paper. 



2. Scaling with the number of processors 

At the FSU HEP group we connected two Intel 
E5405 2 GHz quad-core PCs by a crossover cable 
and installed Open MPI version 1.2.5-5. Up to 8 
MPI processes can be matched by the number of 
cores on one PC, up to 16 on both PCs. The Fortran 
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compiler used was g77 based on gcc version 3.4.6 
(Red Hat 3.4.6-4). 

The Cray XT4 at NERSC features a configura- 
tion of 38 640 AMD Opteron 2.3 GHz quad cores 
with a SeaStar2 switch interconnect and MPICH2 
version 1.0. 6pl is installed using the Portland 
group compiler version 8.0.1. On the Cray we have 
timed our code on up to 1 296 cores. 

All CPU time measurements were done with the 
programs 

cbsu3tiine2inpi{a,b, c}.f (1) 

located in the folder ForProg of the program pack- 
age STMC2LSU3MPI of Ref. [1]. These programs 
perform nequi updating sweeps without measure- 
ments or read or write instructions. The value of 
nequi is set in the parameter file mc.par. Differ- 
ent versions a,b,c are necessary to get MPI send 
and receive instructions for all sublattice choices 
working, sec our appendix for details. 

As there is no standardized Fortran time func- 
tion, we rely on the Unix time command to mea- 
sure the execution time. CPU time needed for ini- 
tialization and creation of the start configuration 
was separately measured (Iswp false in mc.par) 
and subtracted when relevant. 

2.1. PC cluster 

The runs for this section are setup in the 
ITimeOpenMPI and 2Time0penMPI 

folders of our STMC2LSU3MPI tree. 

Table 1 compiles CPU time measurements from 
runs of our program (1) on our two PCs with 16 
cores. As listed, lattice sizes are varied from 8* to 
32". The left part of the table is for PBC (nlat=l), 
the right part for DLT (nlat=2). Times for each 
lattice are normalized to the number of sweeps 
given in the first row of the table and taken in- 
versely proportional to the lattice size. To get suf- 
ficiently accurate results, the actual numbers of 
sweeps were occasionally multiples of those given 
in the first row. Final CPU time uncertainties are 
about a few percent. 

The first ndmpi directions of the lattice are parti- 
tioned into sublattices. For mpif actor=2 the sub- 
lattice extensions in these directions are half of 



Tabic 1 

Execution times on our cluster of two PCs for symmetric 
lattices at /3 = 5.7 with mpif actor=l and 2. The dimension 
of the MPI lattice is ndmpi =n and the number of MPI 
processes np. The left part of the table is for PBC (nlat=l) 
and the right part for DLT (nlat=2). For PBC CPU times 
from a non-MPI run are listed in the third row. 
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those of the entire lattice. As explained in [1] these 
sublattices form a MPI lattice of size 

np = msmpi = nlat * ndmpi * *mpif actor . (2) 

The number of MPI processes agrees with the num- 
ber of points on the MPI lattice and each process 
is picked up by one core unless the number of pro- 
cesses exceeds the number of available cores. 

For comparison we performed also 1-process 
CPU time measurements using (a) a non-MPI 
Fortran program indicated by — in the first two 
columns and (b) our MPI code with mpif actor=l, 
implying np=l. For the 1-process runs of our MPI 
program the parameter Ibcex of latmpi.par al- 
lows one to turn the boundary exchange off or on, 
where off is indicated by a F in the nd column (for 
np>2 boundary exchange has always be turned 
on). CPU times show that slowdown due to MPI 
send and receive instruction for boundary ex- 
change is less than 1% when compared to the usual 
implementation [3] of PBC by pointer arrays. 

The other way round than on the PC used in 
[1], the non-MPI program is slightly slower than 
the MPI program. For practical purposes the dif- 
ference in CPU time consumption as well as in 
performance is negligible. For np=l to 4 we no- 
tice some loss of performance on the 32* system, 
which is likely due to inefficiencies of the used For- 
tran compiler. This disappears when the lattice is 
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Fig. 1. CPU time per SU(3) matrix update. 

partitioned into small enough sublattices. 

For PBC we use the CPU times in table 1 and 
those from the Cray in table 2, to calculate up- 
date times per SU(3) matrix in units of single-core 
processor time. That is the execution time mul- 
tiplied with the number of cores and divided by 
the number of SU(3) matrix updates performed. 
In Fig. 1 we plot the update times in microseconds 
[/is] versus the numbers of cores used. Efficient per- 
formance is found in a range from 4 to 7 microsec- 
onds, which is also typical for other SU(3) lattice 
gauge theory code like that of the MILC collabora- 
tion [4], which is written in C. The reason for our 
better 1-processor performance on the Cray than 
on the PCs appears to be that the Portland goup 
compiler is installed on the Cray, whereas on the 
PCs we had only the gnu compiler available. 

On both, the PCs and the Cray, the CPU time 
per SU(3) update stays up to 8 cores almost con- 
stant, followed by a loss in efficiency when 16 cores 
are employed. This loss is dramatic for small lat- 
tices on our PCs. Up to 8 cores we stay on one PC, 
while for 16 cores communication between the two 
PCs through the crossover cable is relatively slow. 
The surface to volume ratio of the employed sub- 
lattice matters then. This ratio is best (smallest) 
for the largest lattice. Compare 8 x 16'^/16^ = 0.5 
for the 32^ lattice to 8 x 4^/4^ = 2 for the 8^ lat- 
tice. Due to more efficient communication between 
nodes, the effect is far more moderate on the Cray, 
although still visible. 

The results reported in the right part of table 1 



are analogue to those of the left part. The main 
difference is that we now run on a DLT (nlat=2). 
Scaling of CPU time with the number of cores is 
similar as before. The limit of 16 cores is already 
reached for ndmpi=3. 

Using a number of MPI processes, which exceed 
the number of cores is possible but inefficient. Run- 
ning an 8^ lattice with 16 MPI processes on one 
PC with 8 cores needs 30% more CPU time than 
running the same job with 8 MPI processes. Par- 
titioning a 24"* lattice with mpif actor=3 we found 
for ndinpi=l, i.e. 3 MPI processes, an improve- 
ment factor of 2.4 in real time compared to the 1- 
process run. For a run with 9 MPI processes the 
further improvement factor was only 1.4 compared 
to the 3-processes run. Note also that one should 
not execute other jobs in the background even with 
nice 19. Rimning on one PC 8 additional jobs with 
nice 19 took only 5% of the CPU time according 
to the information provided by the top command. 
But due to the resulting uneven balancing the ex- 
ecution time of a MPI with 8 processes went actu- 
ally up by 35% in real time (while getting 95% of 
the CPU time). 

2.2. Cray 



Table 2 

Runs on the Cray analogue to those of table 1. 
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Examples of jobs for this section are setup in 

ICrayTime and 2CrayTime 

folders of our STMC2LSU3MPI tree. They are not 
as easily reproducable as our previously discussed 
runs, because a supercomputer is needed, which 
will rely on its particular job control commands 
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Fig. 2. Improvement factors in real time. 

(those for the NERSC Cray are in the files q . run*, 
the ouput in * . out). 

Table 2 compiles CPU time measurements on 
the Cray, which are analogue to those of table 1. In 
Fig. 2 we use the results of tables 1 and 2 to plot 
the improvement factor in real time, defined as 



FCT = time(l)/time(np) . 



(3) 



versus the number of cores used. Here time(l) is the 
CPU needed without parallelization and time(np) 
the CPU time per core for running with up pro- 
cesses on np cores. The real time one has to wait 
for completion of a job is inversely proportional 
to FCT. 

Up to 8 cores the relationship (3) is practically 
linear in the number of cores. For the 16'' lattice 
the slope is 0.88 on the PCs as well as on the Cray, 
but in the range from 8 to 16 cores it is 0.75 on 
the Cray and down to 0.3 on the PCs. For the 32^* 
lattice the slope is above 0.95 for up to 8 cores on 
the PCs as well as on the Cray. Then, in the range 
from 8 to 16 cores it drops to 0.89 on the Cray and 
to 0.70 on the PCs. Parallelization beyond 8 cores 
on the PCs makes no sense on the 8^ lattice, for 
which the slope between 8 and 16 cores is negative. 

Ultimately, one wants to employ many cores for 
runs on large lattices. In table 3 we use 8^ sub- 
lattices for parallelization of lattices with PBC 
up to size 48^ (ndmpi=4 and mpif actor=6) and 
for DLT lattices up to size 64^8 (ndmpi=3 and 
mpif actor=8). 

Figure 3 plots as function of up to 1 296 cores 



Tabic 3 

Execution times for 256 sweeps at /3 = 5.7 with PBC 
(ndmpi=4, left) and DLT (ndmpi=3, riglit), nf=mpifactor 
in botli cases and np number of processes. 
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Fig. 3. Cray runs with up to 1 296 cores. 

the update times of SU(3) matrices, given on the 
right ordinate, together with the improvement fac- 
tors in real time, given on the left ordinate. All 
update times stay below 7/is. The log scale on the 
left hides to some extent that the performance on 
the DLT is better than with PBC (note, however, 
that we used ndmpi=3 on the DLT and ndmpi=4 
with PBC). Measured in percentages of the peak 
speed obtained in single processor runs one finds 
up to 432 cores an efficiency of about 83% for the 
DLT. For PBC in the range of 256 to 625 cores it 
is around 70%. With 1 296 cores the performance 
with PBC drops to 58%, while it is stih 73% when 
simulating the DLT with 1 024 cores. In all cases 
there is a considerable gain in real time when in- 
creasing the number of participating cores. Users 
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are advised to tune ndmpi and mpif actor for op- 
timal performance on a particular supercomputer 
before running large scale production, evaluating 
then also the performance of measurement rou- 
tines. 



3. Summary and Conclusions 

As computer configurations with large numbers 
of processors drop in price, while the peak per- 
formance of single CPUs is almost stagnant, non- 
trivial parallel processing becomes more important 
than ever. When running parallel applications, the 
user is then first of all interested in his or her gain 
in real (wait) time. So, let us conclude with exam- 
ples from runing our MPI code. For our 32^ lattice 
on a cluster of two PCs the reduction in real time is 
by a factor 1/7.88 when running on eight cores and 
by 1/13.5 when running on all 16 cores. The corre- 
sponding factors on the Cray are 1/7.7 (down due 
to better single core performance) and 1/14.8 (up 
due to better networking). Scaling 8^ sublattices 
on a DLT up to size 64^8 and using 1 024 cores on 
the Cray, the reduction in real time is by a factor 
1/746. Scaling 8^ sublattices to a 48" lattice with 
PBC and using 1 296 cores it was 1/749. 
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Appendix A. MPI Send and Receive 
Subtleties 

The problems discussed in this appendix are re- 
lated to allocating enough buffer space, so that 
MPI can send and receive arrays of a requested size. 
In our 4D runs the size of the sublattice bound- 
ary arrays, which MPI has to send and receive, is 
given by the size of the checkerboard SU(3) matrix 
storage perpendicular to the 1-direction: 



18 * mscb, mscb = nil * nl3 * nl4/2 . (A.l) 

For our Open MPI implementation at FSU the 
maximum size for a Real* 8 array, which could be 
transferred using basic MPI send and receive in- 
structions turned out to be 503, even too small to 
run our SU(3) MPI program with 4* sublattices. 
For the MPICH installation at the Institute for 
Theoretical Physics of Leipzig University the num- 
ber turned out to be 15 999 and on the NERSC 
Cray with MPICH2 it was 16 384. While the latter 
numbers arc sufficiently large to allow for most ap- 
plications, there are exceptions. For instance, only 
the last (ndnipi=4) of the 32''' lattice runs of table 2 
is possible. 

We were unable to find documentation of these 
array size limits. When the program tries to send 
an array larger than the allowed maximum size, 
one encounters a hangup without any error mes- 
sages. Therefore, before submitting a MCMC job 
in its version a, we recommend to check that the 
array size (A.l) can really be transmitted. For this 
purpose the program 

dsenda.f (A.2) 

is kept in MPICHtest and OpenMPItest subfolders 
of the IMPICH and lOpenMPI projects. In this pro- 
gram we have set the array size parameter NDAT to 
the value for which the array transmission works 

on our platforms, while it fails when increasing the 
initial NDAT value by -|-1. Also included is a corre- 
sponding program, isenda.f , which tests on inte- 
ger arrays. 

Before submitting a SU(3) MPI job: Change the 
NDAT parameter in the dsenda.f program to the 
array size, which you need, and confirm that the 
array transfer works on your MPI platform. If yes, 
you can use the a-version of our SU(3) programs. 
If the array transfer hangs up, you will need a solu- 
tion similar, but not necessarily identical, to those 
given in our b and c versions. 

For our Open MPI at FSU it was possible to over- 
come the buffer problem by modifications, which 
are given in the following. Instead of the main pro- 
gram listed in Ref. [1] the version 

cbsu3_dltb.f (A.3) 
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(b for "buffer" ) fias to be used. It replaces in the 
following four include statements 

include '../.. /Libs/MPISU3/cbsu3_bndla_mpia.f * ! Gather boundary, 

include '../.. /Libs/MPISU3/cbsu3_bndlb_rapia. f ' ! Gather boundary, 

include '../.. /Libs/MPISU3/cbsu3_bnd2a_mpia. f ' ! Gather boundary, 

include '../.. /Libs/MPISU3/cbsu3_bnd2b_mpia. f ' ! Gather boundary. 

. . . mpia . f by . . . mpib . f . This exchanges the rou- 
tines, which perform the crucial MPI send and re- 
ceive instructions. The plain version is replaced by 
a routine using buffered send and receive instruc- 
tions. The parts of the two subroutines, where the 
relevant differences lie, are listed in the following. 
In cbsu3_bndla_inpia.f it is 

subroutine cbsu3_bndla_mpiCidg,my_id) ! Bemd Berg Mar 14 2008. 
C Collect boundaries without comers from checkerboard 1 for SU3 action. 

C Send from checkerboard 1 to receive by checkerboard 2: 
irecv=ipf _mpi (idmpi , ismpi) -1 ! Send forward, 
call mpi„sendCalsu3f b ,mscbl8 ,mpi_double_precision, 

k irecv, itag,mpi_comm_world, ierr) 

il=l+msc+Cidmpi-l) *mscb ! Position in alsu3, 

isend=ipb_mpi(idmpi, ismpi)-! ! Received from isend: 

call mpi_recvCalsu3(l , il , idg) ,mscbl8,mpi_double_precision, 

k isend, itag,mpi_comni_world, istatus , ierr) 

irecv=ipb_mpi ( idmpi , ismpi) -1 ! Send backward : 
call mpi_sendCalsu3bb,mscbl8 ,mpi_double_precision, 

k irecv, itag,mpi_comm_world, ierr) 

il=l+nof f set+Cidnipi-l)*mscb ! Position in alsu3 . 

isend=ipf_mpi(idmpi, ismpi)-! ! Received from isend; 

call mpi_recvCalsu3(l , il , idg) ,mscbl8,mpi_double_precision, 

k isend, it ag,mpi_comm_ world, istatus , ierr) 

enddo 

C 

retum 
end 

compared with cbs"Q3_bndla_mpib . f : 

subroutine cbsu3_bndla_mpiCidg,iny_id) ! Bernd Berg Dec 9 2008. 
C Collect boundaries without comers from checkerboard 1 for SU3 action. 

nbuf f =mscbl8 

C Send from checkerboard 1 to receive by checkerboard 2: 
irecv=ipf_mpi( idmpi , ismpi) -1 ! Send forward, 
call mpi_buf fer_att ach (buffer , nbuf f , ierr) 
call mpi_bsendCalsu3fb,mscbl8,mpi_double_precision, 

k irecv, it ag,nipi_comiii_ world, ierr) 

il=l+msc+Cidm.pi-l)*mscb ! Position in alsu3. 

isend=ipb_mpiCidmpi,ismpi)-l ! Received from isend; 

call mpi_recvCalsu3Cl,il , idg) ,mscbl8,mpi_double_precision, 

k isend, it ag,mpi_comm_ world, istatus , ierr) 

call nipi_buff er_det ach (buffer, nbuf f , ierr) 
irecv=ipb_mpi (idmpi , ismpi)-l • Send backward: 
call mpi_buf f er_attach (buffer , mscbl8 , ierr) 
call mpi_bsend(alsu3bb,mscbl8 ,mpi_double_precision, 

k irecv, it ag,mpi_comm_ world, ierr) 

il=l+noffset+(idmpi-l)*mscb ! Position in alsu3. 

isend=ipf_mpi(idmpi,ismpi)-l ! Received from isend: 

call mpi_recv(alsu3 (1 , il , idg) ,mscbl8,mpi_double_precision, 

k isend, itag,mpi_comm_ world, istatus, ierr) 

call mpi_buf fer_det ach (buffer , nbuf f , ierr) 
enddo 

C 

retum 
end 

In the b version 

inpi_buf f er_attach , mpi_buf f er_detach 

statements have been added, and mpi_send has 
been replaced by mpi.bsend. Note that the Fortran 



names of the cbsu3_bnd* . f routines have been 
kept (in contrast to their filenames), so that one 
has only to exchange the include statements in the 
main program. 

Unfortunately, the buffered send and receive in- 
structions of our b version are not universal MPI 
code. They neither work with the MPICH instal- 
lation at Leipzig University nor with MPICH2 on 
the NERSC Cray. Extended buffer sizes in our ver- 
sion c program, 

cbsu3time2mpic.f , (A. 4) 

which is included in the Cray project folders of 
our package, performed well with MPICH2 on 
the Cray, but failed with Open MPI at FSU and 
MPICH at Leipzig University. A unified MPI 
standard appears to be missing. 
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